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Abstract 



Display advertising has been a significant source of revenue for publishers and ad networks in online advertising 
ecosystem. One of the main goals in display advertising is to maximize user response rate for advertising campaigns, 
such as click through rates (CTR) or conversion rates. Although in the online advertising industry we believe that 
the visual appearance of ads (creatives) matters for propensity of user response, there is no published work so far to 
address this topic via a systematic data-driven approach. In this paper we quantitatively study the relationship between 
the visual appearance and performance of creatives using large scale data in the world's largest display ads exchange 
system, RightMedia. We designed a set of 43 visual features, some of which are novel and some are inspired by 
related work. We extracted these features from real creatives served on RightMedia. We also designed and conducted 
a series of experiments to evaluate the effectiveness of visual features for CTR prediction, ranking and performance 
classification. Based on the evaluation results, we selected a subset of features that have the most important impact 
on CTR. We believe that the findings presented in this paper will be very useful for the online advertising industry 
in designing high-performance creatives. It also provides the research community with the first ever data set, initial 
insights into visual appearance's effect on user response propensity, and evaluation benchmarks for further study. 

1 Introduction 

The Internet revolution has transformed how people experience information, media and advertising. Web advertising, 
although nonexisting twenty years ago, has become a vital component of the modern Internet, where advertisements 
are delivered from advertisers to users through different online channels. Recent trends have shown that an increasingly 
large share of advertisers' budgets are devoted to the online world, and online advertising spending has greatly outpaced 
some of the traditional advertising media, such as radio and magazine. Display advertising is one type of online 
advertising which, together with search advertising, contributes the majority of the revenue for many large Internet 
companies. In display advertising, display ad instances are shown to the user on webpages in different formats such as 
image, flash, and video. Each display ad instance is called a creative. By showing the creatives, advertisers aim to either 
promote brand awareness among users (brand advertising) or receive desirable responses from users (performance 
advertising), such as the action of purchasing, clicking or signing up for a promotion list from the advertiser's website. 
In performance advertising, the advertiser strives to optimize their ad's performance metrics such as the effective 
cost per click (eCPC) or effective cost per action (eCPA), which in turn relates to maximizing the user response rate 
on the creatives as measured by click through rates (CTR) or conversion rates (CVR). There are several factors that 
greatly influence the user response rate of display advertising campaigns: 1) the position of the ads on the webpage; 
2) the relevancy of the ads to the online users, which is generally captured by the targeting profiles of the advertising 
campaigns; 3) the relevancy of the ads to the webpage content and 4) the quality and visual appearance of the creatives. 

The problem of predicting the user response rate for online ads, especially CTR, has been studied by several re- 
searchers in the last few years. One major research focus has been in predicting clicks by studying the relationship 
between CTR and the aforementioned ad factors (and their combinations). For example in [2|, the authors consid- 
ered the ad's relevancy to the content of the webpage in predicting CTR. They show that improving the ad's content 
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relevancy is more efficient than considering the content of ads by themselves [25|. Although it is generally believed 
that visually appealing ads can perform better in attracting online users, as a result of which advertisers always care 
about the creative designs, there is no, to the best of our knowledge, published work so far to quantitatively study the 
effect of visual appearance of creatives on campaign performance in online display advertising. This motivates us to 
investigate the correlation between the visual features of the creative and CTR, regardless of other ad factors, and to 
predict creative performance based on its visual appearance alone. 

Our proposed approach consists of two main steps, 1) feature extraction and 2) correlation investigation. We 
first extract some informative visual features from the creatives. We introduce 43 visual features classified into three 
categories, 1) global features which characterize the overall properties of a given creative, 2) local features representing 
the properties of specific parts within a given creative and 3) advanced features which are a group of features developed 
based on more complicated algorithms such as the number of faces and number of characters in a creative. We then 
develop three regression approaches to predict the CTR based on these features. The study is conducted using real 
creatives and their performance data from the world's largest display ads exchange system, RightMedia. Based on 
the weights of developed features, we further select a subset of features that have high impact on the creative's CTR. 
The benefit of this work is three-fold. First, our findings on the visual features and their relationship to CTR can 
provide useful recommendations to designers on what features to consider while designing creatives, and/or can help 
in automated creative generation. Second, the visual features and the regression methods developed here can be used in 
addition to the traditionally investigated ad factors (such as ad relevancy, position etc.) for improving CTR prediction 
in online ads selection. Third, it provides the research community with the first ever data set, initial insights into the 
effect of visual appearance on user response propensity, and evaluation benchmarks for further study. 

The paper is organized as follows. Section [2] introduces the related work. We introduce the visual features in 
Section [3] The regression and feature selection results for CTR prediction are presented in Section [4] followed by our 
conclusion in Section|5] 

2 Background and Related Work 

The relationship between various print ad characteristics and measures of advertising effectiveness has been studied 
by advertising researchers for almost a century. A wide variety of characteristics have been investigated. These 
characteristics are roughly in two categories: mechanical and content-based. The mechanical characteristics include ad 
size, number of colors, proportional of illustrations to copy, the absence of borders, and type size. The content factors 
include message appeal like status, quality, fear and fantasy, attention-getting techniques like free offers, presence 
of women, and psycholinguistic variables like product or personal reference in headline, interrogative or imperative 
headline, visual rhetorics, among others. See [20] for summaries. 

Even though the online advertising has taken a large market share of the advertising industry, and the whole industry 
is steadily and continuously shifting to the online domain, study on the effectiveness of the counterpart of print ads 
online, generally called display ads, is limited. We list the studies of several factors below. 

Some existing studies try to investigate the effect of several different factors on the performance of display adver- 
tising campaigns. These factors include targeting and obtrusiveness [9|, advertisement size (large vs. small) and ad 
exposure format (intrusive vs. voluntary) [4|, cognitive impact from ad size and animation 1181 . emotional appeal and 
incentive offering in the ads 0, repetition of varied execution vs. single execution l30l . 

To the best of our knowledge, we are not aware of any study on the relationship between the visual appearance and 
the performance of creatives in online display ads. We try to tackle this problem by first defining a set of visual features 
and then evaluating their effects on ad performance, specifically CTR in our experiments, from the actively served ad 
campaigns on the world's largest ad exchange system, RightMedia. Below, we present some previous computational 
studies on image properties which provide us inspiration in designing our visual features. 

There are several studies that try to investigate a specific property of images (photos or paintings) using com- 
putational approaches. Such properties include quality and aesthetic in photos |[T9l [T5l l29l 171 or in paintings lfT7l . 
saliency [14|, composition |fT0ll24l . color harmony [5| and memorability |fl3l . 

Initial work on image quality evaluation concentrated on evaluating and reconstructing low graded, compressed or 
degraded images by simple noise model (6][T]. However, in most of the beauty evaluation work, including this paper, 
we assume that high quality images are available and we are interested in evaluating the visual aesthetic of images 
based on visual features. 

Recently some researchers tried to evaluate the beauty of an image based on its visual features. In |fT31 the authors 
aim to classify the pictures into professional and snapshot photos using some basic features including spatial distri- 
bution of edges, color distribution and hue count, etc. In Q the authors introduced a regression based approach for 
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rating photos based on their beauty, using features such as average pixel density, colorfulness, saturation hue, and the 
rule of thirds. In addition to these studies, in |fl9| the author proposed an approach to classifying images into high and 
low quality. The main idea comes from the fact that a professional photographer makes the background blurry and the 
subject distinguishable in the image. By separating the blurry part of the image from the subject, they design a set of 
well-motivated features from both the subject and the whole image such as the clarity contrast of the subject, lighting, 
simplicity, color harmony and composition geometry. They show that the combination of these features can provide a 
promising performance. All of the above work tries to extract visual features from photos. Recently Li et al. [ 17 1 tried 
to extract some features from paintings to evaluate their beauty and classify them into high and low quality. They in- 
troduced a set of global and local features, 40 in total, to capture the painting properties such as the brightness contrast 
between segments, the brightness contrast across the whole image and the average saturation for the largest segment 
of the image. 

Computational approaches have also been used to investigate other visual properties of an image. In |fl3l the 
authors studied what properties of images make them more memorable. They found that statistical properties of an 
image such as mean hue, mean saturation, intensity mean, intensity variance, intensity skewness and number of objects 
do not have any non-trivial correlation with memorability in their generated data set. However, they found that if they 
label the objects and scenes in the images, they can find a non-trivial and interesting correlation between images and 
their memorability. For example, their results show that the attendance of human being, close up objects and human 
scale objects in an image improve its memorability more than natural scene. This result is not possible to be applied to 
our work since it requires large amounts of supervision to tag different parts of the images. However, we evaluate the 
impact of the number of human faces in an image in our work. 

Color harmonization is another approach for making an image more appealing. In the authors proposed to har- 
monize the colors in a given image using harmonization templates from OT1I28 I, which include 8 different harmonized 
color templates. We also used color harmony models to evaluate the hue distribution of an image in our experiments. 

In summary, existing work in related areas has focused primarily on properties of an image, photo or painting. 
In contrast, we examine creatives in online display ads, which contain both graphical features and text. In addition, 
some of the existing approaches require significant amount of supervision in their feature extraction step, which is not 
possible in large scale applications where we need to learn from large data sets with minimum amount of supervision. 
Finally, we would also like to extract a set of features that are visually understandable and can be practically con- 
trolled to guide the human designers or automatic creative generators (like in smart ads) to produce high-performance 
creatives. These objectives make our problem novel and interesting for the online advertising industry. 

3 Feature Extraction 

In this section we introduce a set of 43 different visual features. We categorize the developed features into three 
different sets, 1) global features, 2) local features and 3) advanced features. A complete list of the features can be 
found in Table [3] Below we describe the detailed definition of the proposed features in each category. 

In the following sections we use / to indicate an image and use |/| to indicate the size of the image measured by the 
number of pixels. We use variable x to denote an arbitrary pixel when we do not care about its location in the image. 
Otherwise we use (i, j) to denote the pixel in the z-th row and j-th column in the image. 

3.1 Global Features 

Global features are a set of features which represent the overall properties of the whole image. We describe the details 
of 19 different global features in this section. 

3.1.1 Gray Level Features 

We describe 3 features extracted from the gray level histogram of the image, namely the gray level contrast /i, number 
of dominant gray level bins J 2, and the standard deviation of the gray level values among all pixels fa. 

The gray level contrast is the width of the middle 95% mass in the gray level histogram lfl5l . From the original 
gray level histogram, we prune the extreme 2.5% from the side and 2.5% from the 255 side. Gray level contrast 
feature fi is calculated as the width of the remaining histogram. 

We count the number of dominant bins in the gray level histogram as our second feature. Suppose the set G — 
{30, 3i 1 • • • j 3255} indicates the set of 256 bins in the gray level histogram such that gi is the number of pixel in z-th 
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bins. We define the number of dominant gray level bins as fa = X)fe=o ^iS k — Cl max i 9i)< where l(-) is the indicator 
function and c\ is a threshold value which is set to be 0.01 in this paper. v\ 

The last gray level feature, fa, is defined as the standard deviation of gray level values of all pixels in the image. It 
is used to capture the variance of the gray level distribution. 

3.1.2 Color Distribution 

To avoid distraction from objects in the background, professional photographers tend to keep the background simple. 
In [19 1, the authors use the color distribution of the background to measure this simplicity. We use a similar approach 
to measure the simplicity of color distribution in the image. For a given image, we quantize each RGB channel into 
8 values, creating a histogram H rg f, = {ho, hi, ■ ■ ■ , h$n} of 512 bins, where hi indicates the number of pixels in 
i-th bin. We define feature fa to indicate the number of dominant colors as fa = Y^=o — c 2 max i hi) where 
C2 = 0.01 is the threshold parameter. We also calculate the size of the dominant bin relative to the image size as 
fa = ma j^ hi . This feature indicates the extent to which one of 512 colors is dominant in the image. 

By replacing the RGB color map with HSV (Hue, Saturation, Value) color map and using the above methods in 
calculating features fa and fa, we obtain two other features fa and fa. 

3.1.3 Model-Based Color Harmony 

The concept of color harmony in this paper is based on 8 different harmonic color distributions (illustrated in Figure 
[TJ that are based on the hue of the HSV color wheel PP . These distributions are called i, V, L, I, T, Y, X, N. Note that 
each distribution can be rotated by < a < 360 degrees. The specific size of color harmony distributions are set as 
follows: the large sectors of types V, Y and X are 26% of the disk (93.6°); the small sectors of types i, L, I and Y are 
5% of the disk (18°); the largest sector of type L is 22% of the disk (79.2%); the sector of type T is 50% of the disk 
(180°). The angle between the centers of the two sectors is 180° for I, X, Y, and 90° for L. 



I type V type L type I type 




T type Y type X type N type 




Figure 1: Color harmony models 

Let us define the set of 8 distributions as T) = {d 1 , d 2 , ■ ■ ■, d s }. We say (f>(d l a ,x) indicates the hue of the closest 
point in the i-th distribution to x after a degree rotation, where x is any arbitrary pixel in the image. We compute the 
distance between the hue distribution of our image I and the distribution d % £ T> as: 

d l ) =argmin—r V] || hue(x) - 4>{d l a ,x) \\ -sat(x), (1) 

where hue(x) and sat(x) indicate the hue and saturation at pixel x, and || • || denotes the arc-length distance. We 
are interested in the best fitting model d* which has the least 7(-) value, d* — argmin di -f(I, d 1 ). We define feature 
/s = 7(1, d*). Intuitively, it tells us how different is the hue distribution of image I from the best fitting model of color 
harmony. 

Some models are superset of other models in Figure [T] concluding that the 7( ) value of some smaller models are 
higher than some larger models given any image /, e.g. j(I,di) > j(I,dy) > j(I,dx)- Therefore, if an image 
hue distribution fits into some small models, type i, V, L, I, it fits into larger models as well. This can emphasize the 
color harmony property of the images which can fit into a few models rather than just one model. We consider this 
property as one potential positive property of the image. To quantify this property, we introduce a new feature, fa, 
which indicates the average color harmony deviation from the best two fitted models given an image /. In general, in 

'This parameter, and similar ones in the rest of the paper, is set inspired by related works such as 1191 - 
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addition to the deviation from the best fitted model illustrated by feature fg, we consider the deviation from the second 
best fitted model as well, and the average of these two deviations is returned as f 9 . Clearly, for the images fitting into 
small color harmony models, we will have /§ and fg very close to each other. However, for the images which fit into 
the largest model, we will have fg considerably larger than /§. We believe these two numerical features can represent 
the color harmony property of an image appropriately. 

3.1.4 Color Coherence 

We extract a set of features based on the color coherence of pixels resulting in connected coherent components l23l . A 
connected coherent component in an image is defined as: 

• A set of pixels that fall into the same bin in the histogram. 

• For any two pixels pi and pj in a connected coherent component P = {pi,P2, • • ■ ,Pm} of m pixels, there is a 
path of sequential pixels, pi,pi + i, ■ ■ -,pj. Two sequential pixels in a path must be one of the 8 neighborhoods 
of each other. 

• The size of the connected coherent component is larger than a predefined threshold C4. In our experiment we set 
c 4 = O.Olj J|. 

We denote the set of connected coherent components and their color index as V — {(Pi, hi), (^2,^2), ■■■(Pn> h n )}, 
where Pi is the set of pixels in the i-th component, and hi is its corresponding color in the HSV color histogram with 
512 bins. We use |Pj| to denote the number of pixels in Pj. We extract the following features based on the above 
definition: 

• fio = n, which indicates the number of connected coherent components in the image. 

• fii = ma3 fj| , which indicates the size of the largest component relative to the whole image. 

max I P. I 

j.jVargmaxIPil 

• fi2 = rj-, , representing the size of the second largest connected coherent component relative to 

the whole image. 

• / 13 = rank(/ij), i = argmaxj \Pj\, indicating the rank of the bin, considering the bin size in descending order, 
associated with the largest connected coherent component in the image. For example, the value of this feature 
is 1 if the bin associated with the largest coherent component, argmax., \Pj\, is the largest bin in the color 
histogram as well; max^ \hi\ where hi is the size of the i — th bin in the color histogram. This feature indicates 
how the colors are disperse in the image. We expect to have value /13 = 1 if the colors in the images are not 
very randomly distributed. It means the pixel with the same colors are mostly connected together. 

• /14 = rank(/ij), i = argmax JJ ^ argmaX(c i Pk i \Pj\, similar to /13, it shows the bin rank, considering the bin size 
in descending order, of the second largest connected coherent component in the image. 

3.1.5 Hue Distribution 

In this section we introduce three features based on the hue in HSV color space. We quantize hues in an image in 
a similar way as in (T7| by eliminating the pixels with saturation and value less than 0.2. This will eliminate all the 
pixels with white or black colors. Then we calculate the hue histogram of remaining pixels with 20 different bins, 18° 
for each bin, which results in Tihue = {hi, /12, • • •> foo} where hi indicates the set of pixels in i-th bin. We then extract 
the following features: 

• /15 = 2i=i 1(1 hi I — c 5l^l) where C5 = 0.01 in our experiments. This feature indicates the number of dominant 
hues in an image. 

• /16 = maxi j || \hi\ — \hj\ || where \hi\ > cs\I\, and || • || is the arc length. This feature indicates the largest 
contrast between two dominant hues in the image. 

• fn = std(<£>) where $ = {U; e j || hi(i) — ||} and || . || is the arc length value. This feature indicates the 
standard deviation of all pixel's hues distance from the origin 0. It simply can determine how much the hue 
colors in an images has been distributed from each other. 
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3.1.6 Lightness Features 

We use the lightness L in the HSL color space to calculate feature / 18 and f 19 . In the HSL color space, L value is 
small when the color is white and is large when the color is black. The L value in HSL color space can be calculated 
as follows: 

max(r(x),g(x),b(x)) + min(r(x),g(x),b(x)) 
L(x) = , (2) 

where r(x),g(x), b(x) denotes the R, G, B values of pixel x in RGB color space. We calculate two lightness features 
as: 

• /is = m ^2 X £i L{x), the average lightness of pixels in the image. 

• /19 = std(L(-)), the standard deviation of lightness of all pixels in the image. 

3.2 Local Features 

Local features represent a set of features extracted from specific parts of the image rather than the whole image. 
We apply the normalized cut segmentation method [26 1 to partition the image into 5 smaller segments. Let S = 
{Si, 1S2, • • ■ ,5*5} indicate the set of 5 different segments where Si is the set of pixels in segment i. Note that a 
segment is considered as noise and is dropped if it is smaller than 5% of the image. We develop the following features 
based on the segmentation result. 

3.2.1 Segment Size 

Two features are extracted from segment size as follows: 

• /20 = maJ i}i * , indicating the size of the largest segment relative to the whole image. 

• /21 = tjt rnaxij | \Si\ — \Sj\\, indicating the contrast among the segmentation sizes of the image. 

3.2.2 Segment Hues 



Similar to section 3.1.5| we generate the hue histogram of each segment. We define the set of hue histograms of all 5 



segments as H s hue = {h\\, ft.1.2, ■ ■ ■, ^1,201 ^2,1) ' ' "> ^5,20} where hi j indicates the set of pixels that fall in the j-th 
bin of i-th segment. Then we extract five features to capture different hue properties. Below we describe the formal 
definition of developed features: 

• /22 = X2y=i Id^iJ ^ C 6|^I) where i — argmax^Sj and c 6 = 0.01. This feature denotes the number of 
image-wide dominant hues in the largest segment. In general, we would like to have most of the image hues in 
the largest segment. 

• /23 = ^ Cel-Sjl) where i = argmaXjiSj. This feature denotes the number of segment-wide 
dominant hues in the largest segment. 

• /24 = max<7j where = Y^fLi 1(1 hi j — c 6\Si\) is the number of dominant hues in i — th segment. This 

i 

feature essentially denotes the largest number of dominant hues in one segment. We would like to have the same 
value as /23 for this feature illustrating that the largest segment has the largest number of dominant colors. 

• /25 = max \li — This feature denotes the contrast of the number of dominant hues among the segments. We 

i,j 

usually do not like to have lots of different hues in one segment and a few hues in another segment in an image. 
We expect to have unappealing images with large value for /2s. 

• /26 = m&x\\hi j — h{ k\\ where \hij\, \hi k\ > eg | S*i | , i = argmax^ \Si\ and || • || is the arc length distance. 

This feature captures the contrast of number of pixels among the hue bins in the largest segment. In general, we 
expect to have an appealing image with one bin dominating the largest segment in addition to a few more small 
bins. This makes the contrast value very large. 

• hi = s td(T(-)) where T(i) = maxj^, \hij\, |/ii,jt| > ce | <Si | - This feature returns the standard deviation of 
contrast among the segments. If we have different hue contrasts among different segments, this feature will 
achieve a significant value. 
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Figure 2: The saliency map of an image. Left: original image. Right: saliency map. 



3.2.3 Segment Color Harmony 

Two features are extracted based on the largest segment color harmony. Feature /2s is the minimum deviation from the 
best fitted color harmony model for the largest segment, and feature /2g is the average deviation of the best two fitted 
color harmony models for the largest segment. The details of color harmony models have been introduced in section 

3.2.4 Segment Lightness 

Three segment lightness features are extracted using similar method as in section |3.1.6| 

• /30: average lightness in the largest segment. 

• /31: standard deviation of average lightness among the segments. 

• f32- contrast of average lightness among the segments. 

3.3 Advanced features 

In this section we develop a set of features based on more complicated algorithms. Most of the advanced features are 
based on the saliency map of the image which determines the visually salient areas in the image that are more likely to 
be noticed by the humans. We also extract two additional features related to the number of characters and number of 
faces in an image. Below we describe the details of these features. 

3.3.1 Saliency Features 

Saliency computation is a well known phenomenon in human vision where attention tends to be drawn to interesting 
parts of an image that appear visually different from the rest of the image (e.g., a red coke can in a green background 
appears salient and is immediately noticed, while the same coke can in an orange-reddish background is not salient 
and less likely to be noticed). We compute saliency according to the algorithm described in [12]. Figure [2] shows the 
saliency output of the algorithm presented in [ 12] for a sample creative. The areas with higher lightness in the saliency 
map indicate more salient part of the image. 

The saliency algorithm returns a matrix t (also referred to as saliency map) where r(i,j) represents the saliency 
value of pixel We also extract a binary image based on the saliency map, by setting a threshold a to the saliency 
map where the pixels with saliency value larger than a are set to 1 and the rest of the pixels are set to 0. Similar to |12j, 
the parameter a is set as a = 3f where f = l/raj^ j T~(i,j) is the average saliency value in the image. After this 
binarization, we have some connected components with value 1. These components indicate saliency areas, and the 
other parts of the image are considered as background. Then we extract the following features based on the saliency 
results, saliency map and binary saliency map. 

• /33 : background size. Salient objects usually appear in the foreground and not in the background. Therefore we 
return the size of the background as a function of image size which is calculated as: / 33 = 1 ^ t '^ <Q ^ . 

• /s4: number of connected components in the binary map. 

• 735: size of the largest components in the binary saliency map relative to the whole image. 
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Figure 3: The four interested points based on rule of third. 

• /36: average saliency value of the largest component in the binary saliency map. 

• 737: number of connected components in the image background. In some images, the saliency areas can divide 
the background into several disconnected segments. Usually it is not desirable to have multiple background 
components. 

• /3§: size of the largest connected component in the background relative to the whole image. If the number of 
connected components in the background is equal to one, then this feature has the same value as 733. 

• /3q: distance between connected components. Let the set C = {c\, c 2l • • • , c„}, Cj = (xi, yi) indicates the set 
of n different points such that each Cj indicates a pixel corresponding to the center of mass of the i-th saliency 
area. To make the rest of the computation scale independent from the image size, we update the properties of 
each point Ci as Si = (xi/I x ,yi/ I y ) such that I x and I y are the horizontal and vertical size of the image. Then 
we build up a complete weighted graph given the set C such that the weight Wij between two vertices Ci, Cj is 
calculated as Wij = \\si — Sj\\ 2 - Then we return the summation of all edge weights as the distance between 
connected components. 

• fa: distance from the rule of third points. Professional photographers usually locate their main object in one of 
the four interest points based on the rule of third. The four interest points in rule of third is the intersection of two 
vertical and two horizontal lines dividing the image into 9 equal segments. Figure [3] shows the four interested 
points based on rule of third. This is an important feature in photo beauty evaluation 1171 . motivating us to 
investigate its effect in creative performance. We define this feature as the minimum distance from the center of 
mass of the largest saliency area to one of the four interest points based on rule of third. 

• /41 : distance from the center of image. This feature is the distance of saliency components to the center of image 
which is the most focused part of an image. The overall distance from the centers of all connected components to 
the center of image is returned as feature fa. Note that for both features /40 and fa, we normalize the position 
of each pixel similar as feature fa. 

3.3.2 Number of Characters 

We consider the number of characters in an image as feature fa- We tried a number of OCR toolbox and one of them 
provides us with appropriate results considering the number of characters in ads ll22l . Note that we are interested in 
the number of characters in the image regardless of its meaning. To evaluate the accuracy of the OCR toolbox, we 
counted the true number of characters in 100 random images and compared it to the returned number of characters 
from the OCR toolbox. We found strong linear correlation of 0.80, suggesting that our toolbox is reasonably accurate 
in evaluating the number of characters in images. Note that extracting the exact text from ad creatives is challenging 
as they often appear in different fonts, sizes and orientations. 

3.3.3 Number of Faces 

The last feature, fa, captures the effect of the human face appearance on creative performance. In |fl3l the authors 
concluded that the human appearance in an image could make the image more memorable. This motivates us to test 
whether face appearance affects creative performance. We count the number of faces in an image using an available 
toolbox lfl6l . Our toolbox is reasonably accurate and has a correlation more than 0.9 with the true number of faces in 
images in our experiments with a sample size of 100. 
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(a) Data set ID2 (b) Data set ID6 

Figure 4: The CTR distribution of two data sets. 



4 Experimental Results 



In this section we present the algorithms and experiments we designed to evaluate the relationship between visual 
features and the performance of creatives in online display advertising. 



4.1 Data Set 

We extracted creatives of advertising campaigns from the world's largest online advertising exchange system, Right- 
Media. We filtered out animated creatives because our features are designed for static images. We also calculated the 
average CTR of these creatives from online serving history log during a two-month period. 

As discussed in Section[T] the performance of creatives is determined by many factors. One important factor is the 
ad position in the webpage. Generally the available position of a creative on a webpage is determined by the creative's 
size. To remove the impact on performance introduced by ad position (and size), we create two different data sets, each 
of which consists of creatives with the same size. The first data set, ID2, consists of 6272 creatives with size 250 x 300 
pixels, and the second data set, IDG, includes 3888 images with 90 x 730 pixels. All of the creatives have a minimum 
of 100K impressions guaranteeing that their CTRs have converged to their true values. The CTR distribution of each 
data set is shown in Figure [4] 

We further created two sub-categories from data set ID2: "dating" with 927 images and "traveling" with 599 
images. Since there are not many images in these two categories, we consider the images with a minimum of 20fc and 
lOfc impressions for "dating" and "traveling" respectively. 



4.2 Learning Methods 

The main goal of this work is to study the relationship between the performance of creatives and their visual features. 
In the first step we try to predict CTR from visual features using regression methods. We used three different regression 
algorithms to predict CTR, 1) Linear Regression (LR), 2) Support Vector Regression with RBF kernel(SVR), and 3) 
Constrained Lasso (C-Lasso) which is a modification to Lasso [27 1. 

We used LIBSVM to implement the SVR and performed cross validation to determine the parameters of the 
model. We describe our constrained Lasso optimization approach as follows. Suppose we have a set of n creatives at 
disposal and the visual features of these creatives are represented as a matrix A e R dxn such that A = (ai, a.2, ■ ■ ■ , a n ) 
where a/. € R d is a column vector representing the d dimensional visual features of creative k. In our experiment 
d = 43. The CTR values of the n creatives are represented as a vector y = • • • , y n ) T G R" where each yf. is the 
CTR of the k-th creative. We bound the CTR of each creative by y m in < Vi < J/max where y m in and y max can be 
obtained from online serving history log. To predict CTR of the creatives, we try to solve the following optimization 
problem: 

min || A T w — y\\% + A||w|| i 

T (3) 

s -t- ymin ^ A W < y max 

where || • is Frobinius-2 norm and || ■ ||i is i\ norm, also called lasso. We call the above optimization problem as 
constrained Lasso (C-Lasso) and we used ifTTIl to find the solution of this optimization problem. Note that the proposed 
C-Lasso approach performs better than Lasso in our application. 
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Figure 5: The amount of preserved ranking for each method. 



Table 1 : The prediction accuracy of each method against Random policy. 



Data set 


Samples 


CM 


LR 


C-Lasso 


SVR 


ID2 


6272 


1.71 


2.28 


2.22 


3.27 


ID6 


3888 


1.75 


2.27 


2.14 


2.77 


ID2-Dating 


927 


1.79 


2.65 


2.58 


2.79 


ID2-Traveling 


599 


1.68 


2.13 


2.03 


2.26 



4.3 Evaluation 

In this section we present different evaluation methods to analyze the efficacy of the developed visual features in 
predicting the performance of creatives. 



4.3.1 CTR Prediction 

To evaluate the CTR prediction accuracy of the algorithms, we run each algorithm for 200 independent runs where in 
each run 80% of each data set is selected randomly for training and 20% for testing. The accuracy evaluation results 
are reported over the prediction of the test data. Mean Squared Error (MSE) is used to measure the prediction accuracy 
for each algorithm as follows: 

1 - 

MSE= - y2\y k -y k \ 2 (4) 
n * — ' 

fc=i 

where n is the number of test samples, y k is the true CTR of the /c-th creative calculated from history log, and jjf. 
is the predicted CTR. To meaningfully interpret the MSE value, we introduce two baseline approaches, Random and 
Constant Mean(CM) policy. 

The Random policy simply samples from the CTR distribution of the training data to predict the CTR of each 
testing creative, while the CM policy assigns a constant value, c m , to all ads where c m is the mean CTR of the training 
data. Table [T] shows the average results over 200 independent runs for each algorithm. Each entry is the MSE value 
of the random policy divided by MSE value of each algorithm. Results show that we can perform up to 3.27 times 
better than Random policy in predicting the CTR from visual features only. All learners perform consistently better 
than baseline CM as well. This result demonstrates the non-trivial impact of visual appearance of the creative on its 
advertising performance. 



4.3.2 CTR Ranking 

We introduce a ranking criterion to investigate the ability of using visual features to rank the creatives by their 
CTRs. Given a test set of creatives, suppose c^,c^, • • • , c^T represent the k images with the lowest CTR values 
and cf , &2 , • • ■ , cj£ represent the k images with the highest CTR. Therefore we have k 2 pairs (cj , Cj) such that 
ctr(c~) < ctr(Cj) for i,j £ {1, • • • , k}. We wish to know whether our prediction of CTR using visual features 
preserves the ranking of pairs (c^~, cj"). To test this, we change the value of k as a function of test data size. We then 
measure the percentage of match between the predicted ranking of creatives, and the truly observed ranking in the test 
data. The results over 200 independent runs are shown in figure [5]for different data sets. The x— axis indicates the 
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Figure 6: The classification accuracy for each data set. 

value of S such that k = 6n for S = 0.02, 0.04, 0.06, • • • , 0.50 where n is the number of creatives in the data set, and 
y— axis represents the percentage of correctly ranked pairs. 

Results show that SVR consistently outperforms other learners. As we increase the size of k, the percentage of 
correctly ranked predictions decreases for all learning algorithms. This is as expected, since differentiating the images 
of creatives which have CTRs close to the mean of the CTR distribution, using visual features only, is very difficult 
even for a human. Interestingly, the results show that by just using visual features, we can preserve more than 90% of 
the ranking for data set ID2 (for S = 0.1). This number remains high at 75% when we consider all top-half images 
against low-half images for all data sets (<5 = 0.5). This is an encouraging result that demonstrates the utility of visual 
features in predicting the ranking of CTR. 

4.3.3 CTR Classification 

Previous studies in beauty evaluation J7] Q3] [TT) mostly try to classify the images into high and low quality category 
rather than assigning scores to their beauty based on visual features. Similarly, we evaluate the performance of classi- 
fying the creatives into high (+1) and low (-1) CTR category using visual features only. We use support vector machine 
with RBF kernel as our classifier. Similar to the previous section, we randomly separate 80% of data as training and 
use the rest as testing data. Then, we train our classifier on creatives that belong to the top and bottom 30% in CTR. 
In fact, we are disregarding 40% of data that are close to the training data CTR mean, fj, t , to reduce the noise for the 
classifier. Similar to the ranking experiments, we filter our test set by focusing on the k creatives with highest CTR val- 
ues (labeled as positive) and the k creatives with the lowest CTR values (labeled as negative), where k = Sn is varied 
by changing S. We obtain the classification accuracy by comparing the predicted classes to the true classes obtained 
from real CTR values. Figure [6] demonstrates the average classification accuracy over 200 independent runs where 
each run uses randomly selected training and testing data. The x-axis indicates the value of S and y-axis represents the 
classification accuracy for each data set given a fixed value of S. As seen in the figure, using visual features yields a 
classification accuracy of 70% when S = 0.5. Together with the previous results on predicting and ranking CTR, these 
results show the efficacy of using visual features of creatives in predicting CTR. 

4.4 Feature Selection 

The above analysis shows that visual features are useful in predicting the performance of creatives in online advertising. 
A natural question is to identify the visual features that have strong impact on ad performance. Such information could 
be very useful in many areas. For example, human graphic designers may use this information to guide their design 
of high-performance creatives. Smart ads system may use this information to dynamically generate creatives that are 
more appealing to online users. Ad exchange system may use this information to determine which creative will win in 
the auction marketplace for each advertising opportunity. In this section we conduct a series of experiments to select 
such important visual features. 

We first calculate the Linear Correlation (LC) and Mutual Information (MI) between all features and CTR in each 
data set. Mutual information can provide us with the information of non-linear correlation between features. Note that, 
to calculate the mutual information between any pair of features (X, Y), we discretized each feature and CTR values 
into 50 equal intervals. The results are shown in table|3] The top 5 features in each data set with highest absolute values 
are highlighted in bold. The table shows that there is no feature with high linear correlation or mutual information 
except fi2 in data set ID6. Thus we use Forward Feature Selection (FFS) to select the top k features. 

Before running FFS, we first cluster the features based on the Normalized Mutual Information (NMI) of all feature 
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Table 2: The top 10 selected clusters by FFS. 
Data Set Selected Clusters 



ID2 05, Si, Sig, Sir, S13, Sis, S20, S10, Sn, Sg 
ID6 Si, S2, S20, S5, Sir, S14, Sg, S13, S4, Sis 
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Figure 7: The scatter plot of fi and f 2 against CTR. 



pairs. We discretize each feature into 50 equal intervals, and calculate NMI as follows: 

NMI(X: Y) = , (5) 

where H(X) is the entropy of random variable X. Then we cluster the features using the average linkage algorithm 
EH . Two clusters are merged into one if their average NMI is at least 0.2. This results in 20 clusters for data set 
ID2 and 21 clusters for data set ID6. The resulting clusters are shown in Table [3] In the table, 5, represents a set 
of features in cluster i. We now apply a simple change to the FFS algorithm to select the top k clusters rather than 
features. After selecting a feature by FFS, all the correlated features that belong to the same cluster are removed from 
the next steps of FFS. The selected top k = 10 clusters are shown in table [2] Note that clustering the features in the 
above manner helps select different features (or feature sets) that are less correlated with each other. For example, all 
color harmony features are in the same cluster S4. Therefore by selecting one of the features from this cluster, we 
indicate the importance of color harmony in CTR, and by removing the highly correlated features at each step in FFS, 
we can guarantee to select a set of features which are less correlated with each other. Below we investigate some of 
the selected clusters that are common to both data sets. 

Table [2] shows that Si is the best feature set (or cluster) for data set ID6 and the second best set for data set ID2 
which illustrates the importance of set Si. Si consists of the gray level features /1 and f 2 of the image. The scatter 
plot of both features in data set ID2 is shown in Figur^7](the scatter plot in data set ID6 is similar). Figure [7] shows that 
for creatives with small value in both features, high CTR value is unlikely, and creatives with high CTR values should 
have high values in these two features. This is consistent with the intuition that creatives with higher contrast should 
perform better. Note that having high values in these two features does not guarantee a high CTR value. 

S5 is the best feature set for data set ID2 and the fourth for ID6. It only includes /iq which is the number of 
connected coherent components. The scatter plot of fio in both data sets are shown in figure[8] The scatter plot shows 
that creatives with more than 15 connected coherent components in data set ID6 and more than 20 in data set ID2 
are unlikely to achieve a CTR higher than 0.01. In other words, this suggests that cluttered creatives containing many 
objects tend to have lower CTR. 

The number of characters, Sig in data set ID2 and S20 m data set ID6, is interestingly the third important feature 
set in both data sets. Figure [9] shows the scatter plot of the number of characters in both data sets. It can be seen that 
the creatives with higher number of characters are unlikely to achieve high CTR values in both data sets, once again 
suggesting that textual clutter is undesirable. 

The next selected categories is Sn which is the 4-th selected category in ID2 and the 5-th in ID6. Sn represents the 
number of connected components in saliency binary map, distance between salient components, distance of saliency 
areas from the center of image and rule of third closest point. This indicates the importance of saliency features as well 
as considering professional photography rules such as the rule of third in designing ads. Intuitively, a small number of 
salient components, closer to the center of the creative, and consistent with the rule of third are desirable features in 
a creative. Finally, Sis, which contains features describing the number of hues and the contrast of hues in the largest 
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Figure 8: The scatter plot of / 10 . 
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Figure 9: The scatter plot of number of characters. 



segment of the image, is the 5-th important common category considering both data sets. Note that the scatter plot of 
the last 2 selected categories have been omitted due to space limit. In summary, our top 5 selected categories include 
the features from all proposed feature categories, global, local and advanced features, indicating the importance of 
each of them in predicting the creatives CTR. 



5 Conclusion 

In this paper we investigated the relationship between the user response rate and the visual appearance of creatives in 
online display advertising. To the best of our knowledge, this is the first work in this area. We designed 43 visual fea- 
tures for our experiments. We extracted the features from large scale data produced by the world's largest ad exchange 
system. We tested the utility of visual features in CTR prediction, ranking and classification. The experimental results 
demonstrate that our proposed framework is able to outperform baseline consistently, indicating the efficacy of visual 
features in predicting CTR. We also performed feature selection to select the top visual feature categories that have 
strongest importance for increasing CTR. The findings from this work will be useful for ads selection and developing 
visually appealing creatives with higher user response propensity in online display advertising. 
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